Structural Equation Modeling
Invalid Date
What you did since the beginning of the year (with link functions or not) was something like this \[
y = X\beta + \epsilon
\] Where \(y\) is the response variable, \(X\) the set of predictors and \(\epsilon\) the error term.
These models, - assume that all variables are directly observed/manifest - allow measurement errors only in endogenous variables - are just particular cases of SEM
In fact, the structural model of a SEM (i.e., excluding latent variables) can be expressed with the following equation: \[
Y = X^\ast B' + \zeta
\] Where - \(Y\) is the (n x p) matrix of endogenous variables - \(X^\ast\) is the n x (p + q) matrix of endogenous and exogenous variables - \(B\) is the (p + q) x (p + q) coefficient matrices - \(\zeta\) is the (p x q) matrix of errors in the equations This looks pretty similar to the regression formula, but with some matrices!
Univariate regression models are just a special case of this formula where the parameter matrix is full of 0!
y x1 x2 x3
y 0 1 2 3
x1 0 0 0 0
x2 0 0 0 0
x3 0 0 0 0
y x3 x2 x1
y 0 3 2 1
x3 0 0 5 4
x2 0 0 0 6
x1 0 0 0 0
LET'S TRY TO FIT AND COMPARE REGRESSION MODELS
Imagine you want to predict scores in the test we will do at the end of this corse (\(y\)), based on your prior statistical knowledge (\(x_1\)) and interest (\(x_2\)): - First define the model
Call:
lm(formula = y ~ x1 + x2, data = d)
Residuals:
Min 1Q Median 3Q Max
-2.8022 -0.6244 -0.0259 0.7150 1.8090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0251 0.1009 -0.25 0.80
x1 0.4846 0.1172 4.14 7.5e-05 ***
x2 0.1226 0.1015 1.21 0.23
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.01 on 97 degrees of freedom
Multiple R-squared: 0.162, Adjusted R-squared: 0.145
F-statistic: 9.36 on 2 and 97 DF, p-value: 0.000191
lavaanlavaan (latent variable analysis) is actually THE package for SEM. You can use it to estimate a wide family of latent variable models, including: factor analysis, structural equation, longitudinal, multilevel, latent class, item respons, and missing data models…
..But also simple regressions
lavaan 0.6-19 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 4
Number of observations 100
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
[...]
[...]
Regressions:
Estimate Std.Err z-value P(>|z|)
y ~
x1 0.485 0.115 4.199 0.000
x2 0.123 0.100 1.227 0.220
Intercepts:
Estimate Std.Err z-value P(>|z|)
.y -0.025 0.099 -0.253 0.800
Variances:
Estimate Std.Err z-value P(>|z|)
.y 0.987 0.140 7.071 0.000
R-Square:
Estimate
y 0.162
[...]
QUESTIONS? COMMENTS? What about lm?
y x1 x2
y 0 2 3
x1 0 0 0
x2 0 0 0
y x1 x2
y 0 0.485 0.123
x1 0 0.000 0.000
x2 0 0.000 0.000
y x1 x2
y 4
x1 0 0
x2 0 0 0
y x1 x2
y 0.987
x1 0.000 0.741
x2 0.000 0.014 0.988
As you can see, the regression syntax of lavaan is actually the same as lm, but there is much more in lavaan.
Model specification sintax:
| Syntax | Function | Example |
|---|---|---|
~ |
Regress onto | Regress B onto A: B ~ A |
~~ |
Residual (co)variance | Variance of A: A ~~ A Variance of A and B: A ~~ B |
=~ |
Define a reflective LV | F1 is defined by items 1-4: F1 =~ i1 + i2 + i3 + i4 |
<~ |
Define a formative LV | F1 is defined by items 1-4: F1 <~ i1 + i2 + i3 + i4 |
:= |
Define non-model parameters | u2 := x + y |
* |
Label or fix parameter | Z ~ b*X labels the regression as b |
| Function | Command |
|---|---|
sem() / cfa() |
Fit the SEM model (cfa is nested in sem…which is nested in lavaan) |
fitMeasures() |
Return fit indices of the SEM model |
inspect() |
Inspect/extract information that is stored in a fitted model |
lavPredict() |
Compute estimated latent scores |
lavTestLRT() |
Compare (nested) lavaan models |
modificationIndices() |
Compute the modification indices of a model |
parameterEstimates() |
Parameter estimates of a latent variable model |
parameterTable() |
Show the table of the parameters of a fitted model |
simulateData() |
Simulate data starting from a lavaan model syntax |
Just a very easy exercise to start practicing with the lavaan sintax. The example is similar to the one above.
The dataset “Exercise1.csv” contains: - N = 1100 participants - 5 variables - - gender: coded 1;2 - age: between 13 and 19 - anxiety - nevroticism - math We want to know whether anxiety and nevroticism have an effect on math achievement. Additionally, is there any effect of demographics variables?
lavaan 0.6-19 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 5
Number of observations 723
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
[...]
[...]
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
math ~
age 0.052 0.009 5.452 0.000 0.052 0.170
anxiety 0.357 0.021 16.878 0.000 0.357 0.591
nevroticism -0.156 0.021 -7.354 0.000 -0.156 -0.257
gender 0.020 0.039 0.518 0.604 0.020 0.016
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.math 0.263 0.014 19.013 0.000 0.263 0.698
WE JUST ANALYZED NEUROTICISM AND ANXIETY. DO WE REALLY BELIEVE THEY ARE ON THE SAME LEVEL?
Path models are just pictorial representations of theoretical relationships between variables.
Representations (and simulations) should be the starting point of almost every study.
Based on these representations, we can build a statistical model and estimate the theoretical paths.
This, of course, is possible in lavaan.
For example, do we really believe that anxiety and nevroticism could stay on the same level? Isn’t there a theoretical effect of one on the other?
For example, do we really believe that anxiety and nevroticism could stay on the same level? Isn’t there a theoretical effect of one on the other?
Code: We can write the additional regression just by adding one line to the model (plus some additional things for indirect and total effects):
We have collected data from 1100 Italian cities.
Research question: to what extent do economic factors, education, and environmental policies influence the quality of life of Italian people? Are these relationships mediated by social cohesion? - gdp - education - environment - wellbeing - cohesion The model can be expressed with the following equations: \[
\begin{cases}
\text{cohesion} & = \beta_{13} X_{\text{gdp}} + \beta_{14} X_{\text{education}} + \beta_{15} X_{\text{environment}} + \zeta_1 \\
\text{wellbeing} & = \beta_{23} X_{\text{gdp}} + \beta_{24} X_{\text{education}} + \beta_{25} X_{\text{environment}} + \beta_{21} X_{\text{cohesion}} + \zeta_2
\end{cases}
\]
There is an effect of cohesion, which is in the middle.
[...]
Regressions:
Estimate Std.Err z-value P(>|z|)
cohesion ~
gdp -0.324 0.030 -10.722 0.000
education 0.178 0.030 5.883 0.000
environment 0.453 0.030 15.177 0.000
wellbeing ~
gdp 0.371 0.030 12.372 0.000
education 0.154 0.029 5.304 0.000
environment -0.132 0.031 -4.251 0.000
cohesion 0.678 0.028 23.821 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.cohesion 1.004 0.043 23.452 0.000
.wellbeing 0.896 0.038 23.452 0.000
cohesn wllbng gdp eductn envrnm
cohesion 1.004
wellbeing 0.000 0.896
gdp 0.000 0.000 1.032
education 0.000 0.000 0.173 1.035
environment 0.000 0.000 0.038 -0.065 1.029
cohesn wllbng gdp eductn envrnm
cohesion 0.000 0 -0.324 0.178 0.453
wellbeing 0.678 0 0.371 0.154 -0.132
gdp 0.000 0 0.000 0.000 0.000
education 0.000 0 0.000 0.000 0.000
environment 0.000 0 0.000 0.000 0.000
It is composed by the actual residual covariance matrix \(\Psi\)
And the fitted/reproduced covariance matrix \(\hat{\Sigma}(\theta)\)
cohesion wellbeing gdp education environment
cohesion 1.00 0.53 -0.25 0.08 0.38
wellbeing 0.53 1.00 0.17 0.24 0.14
gdp -0.25 0.17 1.00 0.17 0.04
education 0.08 0.24 0.17 1.00 -0.06
environment 0.38 0.14 0.04 -0.06 1.00
N = 483
m <- "
lifeSatisfaction ~ .05*attachment + .25*selfEsteem + .40*parentalSupport + .30*salary
selfEsteem ~ .40*parentalSupport + .20*attachment
attachment ~~ .30*parentalSupport
"
m1 <- "
lifeSatisfaction ~ selfEsteem + salary #+ parentalSupport
selfEsteem ~ parentalSupport + attachment
#attachment ~~ parentalSupport
"
E2 <- simulateData(m, sample.nobs = N, seed = 12)The dataset “Exercise2.csv” contains:
We want to fit this model (also calculate indirect effects):
mE2 <- "
#we name (_*variable) the parameters of interest
lifeSatisfaction ~ a*selfEsteem + salary
selfEsteem ~ b*parentalSupport + c*attachment
attachment ~ salary
parentalSupport ~ salary
#and then define the non-model parameters
indAttSelf := c*a
indSuppSelf := b*a
"
fitE2 <- sem(mE2, data = E2) #, se = "bootstrap")lavaan 0.6-19 ended normally after 10 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 10
Number of observations 483
Model Test User Model:
Test statistic 114.172
Degrees of freedom 4
P-value (Chi-square) 0.000
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
[...]
[...]
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
lifeSatisfaction ~
selfEsteem (a) 0.373 0.044 8.379 0.000 0.373 0.349
salary 0.234 0.049 4.805 0.000 0.234 0.200
selfEsteem ~
prntlSpprt (b) 0.360 0.049 7.349 0.000 0.360 0.314
attachment (c) 0.145 0.046 3.177 0.001 0.145 0.136
attachment ~
salary -0.009 0.047 -0.198 0.843 -0.009 -0.009
parentalSupport ~
salary 0.001 0.043 0.025 0.980 0.001 0.001
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.lifeSatisfactn 1.112 0.072 15.540 0.000 1.112 0.838
.selfEsteem 1.028 0.066 15.540 0.000 1.028 0.883
.attachment 1.016 0.065 15.540 0.000 1.016 1.000
.parentalSupprt 0.885 0.057 15.540 0.000 0.885 1.000
[...]
The indirect effects estimated with lavaan in this way are just a mere multiplication of the parameters. You can apply bootstrap procedures to obtain more robust results and errors!
[...]
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indAttSelf 0.054 0.018 2.971 0.003 0.054 0.047
indSuppSelf 0.134 0.024 5.525 0.000 0.134 0.110
You surely remember this slide from before:
*’This is the goal of a good model: reproduce, from a set of theoretical associations/effects (aka covariance matrix), the original covariance matrix.
Formally:
\[ H_0 : \hat{\Sigma}(\theta) = \Sigma \]
where \(\Sigma\) is the true covariance matrix among model variables, \(\theta\) the parameters vector, and \(\hat{\Sigma}\) the reproduced covariance matrix.’*
\[ H_0 : \hat{\Sigma}(\theta) = \Sigma \]
It’s time to evaluate whether our model is capable of it. To do it, we mainly compare the two matrices and obtain different fit indices.
Model fit refers to the ability of a model to reproduce the original covariance matrix. Fit indexes are the tools we use to estimate how good is our model in reproducing such mutrix. Most fit indexes only work with overidentified models, but some (e.g., the residualbased ones) can work with just-identified models, as well.
A T statistics following \(\chi^2\) distribution can be obtained by multiplying the sample size with the value of the fit function. df will be equal to the amount of non-redundant information minus the number of estimated parameters (remember?).
This is rarely used because of its assumptions - independent observations - the non-standardized sample cov matrix is used - N is sufficiently large - manifest endogenous variables follow a multivariate normal …and because it tends to reject \(H_0\), especially with large sample sizes.
These include: - Comparative Fit Index (CFI) - Normed Fit Index (NFI) - Tucker Lewis Index (TLI) These indices compare the user model with a baseline model (the worst model).
lavaan 0.6-19 ended normally after 10 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 10
Number of observations 483
Model Test User Model:
Test statistic 114.172
Degrees of freedom 4
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 276.610
Degrees of freedom 10
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.587
Tucker-Lewis Index (TLI) -0.033
[...]
[ = ] [ = ]
With \(\delta\) being the difference between \(\chi^2\) and \(df\) and \(\delta(\text{Saturated}) = 0\)
These include: - Information Criteria (AIC, BIC, SABIC) - Noncentrality Parameter-Based Indexes (RNI, Mc/MFI) - RMSEA: Root Mean Square Error of Approximation These models favor parsimony and penalize complex models.
Among these, the RMSEA is probably the most used index. It assess how well the model approximate the data (as opposed to assessing if it is an exact fit). It is bounded between 0.0 and 1.0, with values closer to 0 (ZERO) indicating better fit.
These include: - \(\chi^2 / *df*\) ratio - GFI: Goodness of Fit Index - AGFI: Adjusted Goodness of Fit Index - SRMR/RMR: (Standardized) Root Mean Square Residual GFI and AGFI want to be 1, while the SRMR wants to be 0!
Remember that just because your model fits the data, doesn’t allow to conclude that your model is correct nor that the data generation process follows your hypothesized paths. - With path analysis, the same fit might be achieved with opposite arrows! - Test alternative models - Errors are included in manifest variables…and in the estimates! As usual all models are false, but some are useful.
In Exercise 2 we fit a perfect model, all our hypotheses were confirmed, effects were significant with three stars.
We were happy…
Unfortunately, it’s time to modify the model or to accept that something was wrong in our hypotheses.
If you have no clue about the reason why the model doesn’t converge, statistics could help you.
lhs op rhs mi epc sepc.lv sepc.all
14 lifeSatisfaction ~~ selfEsteem 63.000 -1.128 -1.128 -1.055
27 parentalSupport ~ lifeSatisfaction 62.255 0.337 0.337 0.412
21 lifeSatisfaction ~ parentalSupport 56.583 0.404 0.404 0.330
16 lifeSatisfaction ~~ parentalSupport 56.583 0.358 0.358 0.361
25 attachment ~ selfEsteem 48.860 0.945 0.945 1.012
29 parentalSupport ~ attachment 48.570 0.296 0.296 0.317
26 attachment ~ parentalSupport 48.570 0.340 0.340 0.317
19 attachment ~~ parentalSupport 48.570 0.301 0.301 0.317
28 parentalSupport ~ selfEsteem 48.473 2.032 2.032 2.332
22 selfEsteem ~ lifeSatisfaction 38.716 -0.670 -0.670 -0.715
24 attachment ~ lifeSatisfaction 9.730 0.136 0.136 0.155
15 lifeSatisfaction ~~ attachment 5.275 0.112 0.112 0.105
20 lifeSatisfaction ~ attachment 5.275 0.110 0.110 0.097
17 selfEsteem ~~ attachment 0.752 4.478 4.478 4.380
23 selfEsteem ~ salary 0.752 0.041 0.041 0.037
31 salary ~ selfEsteem 0.752 0.038 0.038 0.042
30 salary ~ lifeSatisfaction 0.752 0.103 0.103 0.120
Use it with caution!
While the independent variable is assumed to cause the outcome variable, it’s total effect (\(c\)) is partially/totally mediated by another intervening variable, the mediator variable.
Note that our independent causes the mediator (\(a\)), and the mediator causes the outcome (\(b\)). The independent can still cause the outcome, but the path might have changed to \(c'\).
When we have a mediation, the effect of the independent variable (X) on the outcome is given by the sum of the direct and indirect effect.
total effect = direct effect + indirect effect
OR
c = c’ + ab
COMMENTS?
The problem with mediation analyses is that they rely on often-neglected assumptions:
cfi tli srmr rmsea
0.976 0.853 0.031 0.094
cfi tli srmr rmsea
0.976 0.853 0.031 0.094
Regressions:
Estimate Std.Err z-value P(>|z|)
math ~
anxiety 0.350 0.050 6.979 0.000
sex 0.143 0.106 1.347 0.178
height 0.034 0.048 0.699 0.484
sex ~
anxiety 0.075 0.024 3.092 0.002
height ~
sex 0.984 0.091 10.807 0.000
[...]
Regressions:
Estimate Std.Err z-value P(>|z|)
math ~
anxiety 0.350 0.050 6.979 0.000
sex 0.143 0.106 1.347 0.178
height 0.034 0.048 0.699 0.484
anxiety ~
sex 0.271 0.088 3.092 0.002
height ~
sex 0.984 0.091 10.807 0.000
[...]
About mediation analyses, you can read: - the webpages by Dave Kenny: - - The history of mediations - Mediations explained and advanced topics - Roher et al. publication…but there’s a lot of information